Skip to content

feat(install): slim install for remote/NIM-only inference on Intel Mac#1830

Open
charlesbluca wants to merge 30 commits intoNVIDIA:mainfrom
charlesbluca:slim-install
Open

feat(install): slim install for remote/NIM-only inference on Intel Mac#1830
charlesbluca wants to merge 30 commits intoNVIDIA:mainfrom
charlesbluca:slim-install

Conversation

@charlesbluca
Copy link
Copy Markdown
Collaborator

@charlesbluca charlesbluca commented Apr 9, 2026

Description

This branch streamlines the default NeMo Retriever install and dependency story, and tightens optional vs core packages in nemo_retriever/pyproject.toml.

Dependency and install model

  • Core dependencies focus on orchestration (Ray), ingest API/client, HTTP stack, LanceDB, PDF/image/NIM-related libs, and shared utilities. nv-ingest is no longer a direct dependency of nemo-retriever; editable installs still wire nv-ingest-api and nv-ingest-client from the monorepo paths.
  • New optional groups: [local] (GPU-oriented torch, transformers, Nemotron model packages, vLLM on Linux, etc.), [multimedia] (audio/ASR + SVG via cairosvg), [stores] (DuckDB and Neo4j), [benchmarks] (datasets / open-clip), and [all] as a convenience meta-extra.
  • Torch / torchvision resolution: CUDA 13 wheel index on Linux and Windows; macOS falls back to PyPI CPU wheels. The renamed UV index (pytorch-cu130) matches that layout.
  • uv.lock is regenerated to match the slimmer dependency graph.

Docs

  • nemo_retriever/README.md: clearer split between remote NIM-only vs local GPU setups; documents [local] and the CUDA 13 torch override; SVG install now points at [multimedia] (with a direct cairosvg alternative).

Runtime and API behavior

  • Lazy / guarded imports and small compatibility fixes (e.g. Ray 2.49) across ingest API utilities and retriever modules so optional stacks fail more gracefully when extras are not installed.
  • Text embeddings: CPU embedding path supports remote HTTP when an endpoint URL is set, otherwise local Hugging Face on CPU; shared URL normalization avoids doubling /embeddings on HTTP embedding URLs (delegates to nv_ingest_api helper). Tests aligned with the local-HF default when no remote URL is configured.

CI

  • Unit tests install nemo_retriever[all,dev] instead of separate editable src/api/client plus nemo_retriever.
  • New scheduled / dispatch workflow integration-test-library-mode.yml: library-mode graph pipeline on Windows and macOS (ARM and Intel), using NVCF_API_KEY for remote NIM endpoints.

Misc

  • Removes obsolete agent documentation; assorted small fixes (e.g. Milvus client usage, executor/ocr/pdf paths) driven by the slimmer dependency surface.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

charlesbluca and others added 16 commits April 8, 2026 17:56
…d macOS

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
macOS CI runners (Apple Silicon) can be slow to start the Ray raylet,
causing the node to time out during startup. Setting the wait to 120s
prevents premature failures on resource-constrained runners.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Guard cv2, sklearn, langdetect, minio, pymilvus, and unstructured_client
behind try/except or deferred imports so nemo_retriever and nv_ingest_api
can be imported without GPU/local-inference packages installed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- pdfium.py: numpy channel-swap fallback when cv2 is None
- extract.py: PIL encode fallback for page image rendering without cv2
- transforms.py: PIL fallbacks for numpy_to_base64, base64_to_numpy
- table_and_chart.py (api + nemo_retriever): guard bare sklearn DBSCAN
  imports with ImportError fallback using naive y-grid row grouping
- ocr/shared.py: same guard for _blocks_to_pseudo_markdown DBSCAN
- lancedb_store.py: skip index creation when no embedding rows written
- text_embed/runtime.py: raise on embedding failure instead of silent no-op
- graph/executor.py: per-stage debug logging for inprocess pipeline
- text_embed/cpu_operator.py: debug logging for embed input/output stats

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Windows also supports CUDA torch wheels; extend the platform marker to
sys_platform == 'linux' or sys_platform == 'win32' so Windows users
get the GPU-enabled wheel from the PyTorch CUDA index rather than
falling through to the CPU-only PyPI wheel.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@charlesbluca charlesbluca requested review from a team as code owners April 9, 2026 15:15
@charlesbluca charlesbluca requested a review from nkmcalli April 9, 2026 15:15
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 9, 2026

Greptile Summary

This PR streamlines the nemo_retriever install surface by splitting heavy GPU/ML deps into optional groups ([local], [multimedia], [stores], [benchmarks]), adds PIL fallbacks for every cv2-dependent code path so that the slim [remote] install works on Intel Mac, normalizes the embedding URL construction via a shared ensure_openai_embeddings_http_url helper, and fixes several usability issues (Ray 2.49 concurrency requirement, docstring placement, empty-file guard in text splitting). All previously identified P0/P1 concerns from prior review threads have been addressed.

Confidence Score: 5/5

Safe to merge; all previously identified P0/P1 concerns are addressed and only P2 style/edge-case findings remain.

The milvus.py None sentinel fix, _resize_image_opencv cv2 crash fix, and graceful LanceDB core-dep handling resolve the main prior concerns. New findings are a PIL RGBA compositing mismatch, a dead except ImportError, inline logging import, and a missing empty-content guard in txt_bytes_to_chunks_df — all P2 and none blocking.

api/src/nv_ingest_api/util/image_processing/transforms.py (PIL RGBA fallback), nemo_retriever/src/nemo_retriever/txt/split.py (empty-bytes guard)

Important Files Changed

Filename Overview
api/src/nv_ingest_api/util/image_processing/transforms.py Adds PIL fallbacks for every cv2 code path; minor RGBA compositing mismatch and a vestigial except ImportError remain.
nemo_retriever/src/nemo_retriever/text_embed/runtime.py Converts print/BaseException error handler to proper logger.error/Exception with graceful DataFrame return; correct and intentional behavior change.
nemo_retriever/src/nemo_retriever/text_embed/cpu_operator.py Adds local HF embedder path when no remote endpoint is configured; logic and lazy import pattern are correct.
nemo_retriever/src/nemo_retriever/graph/executor.py Adds Ray 2.49 concurrency=1 compatibility fix and per-stage debug logging; logging import placed inside method body rather than at module level.
nemo_retriever/pyproject.toml Moves GPU/ML deps to [local], audio/SVG to [multimedia], SQL/graph stores to [stores], benchmarks to [benchmarks]; lancedb stays in core; torch index renamed to pytorch-cu130 with correct platform markers.
api/src/nv_ingest_api/util/string_processing/init.py New ensure_openai_embeddings_http_url helper is idempotent and correctly prevents /embeddings/embeddings doubling.
.github/workflows/integration-test-library-mode.yml New nightly integration test for Windows/macOS ARM/Intel; third-party actions still on mutable version tags (previously noted), and no permissions: block is declared.
nemo_retriever/src/nemo_retriever/vector_store/lancedb_store.py Removes redundant try/except around import lancedb inside _write_rows_to_lancedb; adds empty-rows guard in handle_lancedb; import lancedb at module level is appropriate since lancedb is a core dependency.
nemo_retriever/src/nemo_retriever/txt/split.py Fixes docstring placement and adds empty-content early-return in txt_file_to_chunks_df; txt_bytes_to_chunks_df is missing the same guard.

Sequence Diagram

sequenceDiagram
    participant GP as graph_pipeline
    participant CPU as _BatchEmbedCPUActor
    participant RT as embed_text_main_text_embed
    participant HF as create_local_embedder
    participant NIM as Remote NIM /embeddings

    GP->>CPU: __init__(EmbedParams)
    alt embed_invoke_url set
        CPU->>CPU: self._model = None
        Note over CPU: Remote path
    else no endpoint
        CPU->>HF: create_local_embedder(model_name, device=cpu)
        HF-->>CPU: local HF model
        Note over CPU: Local CPU path
    end

    GP->>CPU: process(df)
    CPU->>RT: embed_text_main_text_embed(df, model=self._model, ...)
    alt model is not None (local)
        RT->>HF: model.embed(texts)
        HF-->>RT: vectors
    else endpoint is set (remote)
        RT->>NIM: POST /embeddings (ensure_openai_embeddings_http_url)
        NIM-->>RT: embeddings JSON
    end
    alt success
        RT-->>CPU: out_df (has_embedding=True)
    else Exception
        RT->>RT: logger.error(..., exc_info=True)
        RT-->>CPU: out_df (has_embedding=False, error=str(exc))
    end
    CPU-->>GP: out_df
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: api/src/nv_ingest_api/util/image_processing/transforms.py
Line: 697-718

Comment:
**PIL RGBA fallback composites on black, cv2 path uses white**

When cv2 is unavailable, `Image.open(BytesIO(image_bytes)).convert("RGB")` composites alpha onto a black background (PIL's default), while the cv2 path calls `rgba_to_rgb_white_bg` which blends against white. For document images (which are typically white-background), this produces visually different results in the cv2-free path.

```suggestion
        else:
            from io import BytesIO

            pil_img = Image.open(BytesIO(image_bytes))
            if pil_img.mode == "RGBA":
                background = Image.new("RGB", pil_img.size, (255, 255, 255))
                background.paste(pil_img, mask=pil_img.split()[3])
                pil_img = background
            else:
                pil_img = pil_img.convert("RGB")
            img = np.array(pil_img)
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: api/src/nv_ingest_api/util/image_processing/transforms.py
Line: 714-715

Comment:
**Vestigial `except ImportError: raise`**

After the refactoring the only potential `ImportError` source inside this `try` block is `from io import BytesIO`, which is a stdlib module that cannot fail. The clause is dead code and may confuse future readers.

```suggestion
    except Exception as e:
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/graph/executor.py
Line: 131-133

Comment:
**Module-level logger preferred over inline import**

`import logging as _logging` and the `_exec_log = ...` call are placed inside `execute()` on every invocation. The standard pattern is a single module-level `logger = logging.getLogger(__name__)`. If the goal was to avoid conflicts with an existing name, the existing module-level logger can simply be reused.

Move `import logging` and `logger = logging.getLogger(__name__)` to module scope alongside the existing imports, then replace `_exec_log` references with `logger`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/txt/split.py
Line: 217-220

Comment:
**`txt_bytes_to_chunks_df` missing empty-content guard**

`txt_file_to_chunks_df` received an early return for empty/whitespace-only content (returning a correctly-typed empty DataFrame). `txt_bytes_to_chunks_df` did not get the same treatment. With empty bytes, `raw` will be `""`, `split_text_by_tokens` may return `[]`, and `pd.DataFrame([])` produces a 0-column DataFrame — downstream callers that access `df["text"]` would raise `KeyError`.

```python
raw = content_bytes.decode(encoding, errors="replace")
if not raw or not raw.strip():
    return pd.DataFrame(
        columns=["text", "path", "page_number", "metadata"],
    ).astype({"page_number": "int64"})
model_id = tokenizer_model_id or DEFAULT_TOKENIZER_MODEL_ID
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: .github/workflows/integration-test-library-mode.yml
Line: 1-10

Comment:
**Missing `permissions:` block**

Per the `github-actions-security` rule, new workflows should declare an explicit least-privilege `permissions:` block. Without it the job inherits the repository's default token permissions. A read-only scope is sufficient for this checkout-and-test workflow:

```yaml
permissions:
  contents: read
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (5): Last reviewed commit: "test(embed): align BatchEmbedCPUActor te..." | Re-trigger Greptile

Comment on lines +47 to +54

- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: '3.12'

- name: Install uv
run: pip install uv
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 GitHub Actions not pinned to commit SHA

Both actions/checkout@v4 and actions/setup-python@v5 use mutable version tags. Per the repository's github-actions-security rule, third-party actions must be pinned to a full commit SHA to prevent supply-chain attacks.

Suggested change
- name: Set up Python 3.12
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install uv
run: pip install uv
- name: Check out repository code
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
ref: ${{ inputs.source-ref != '' && inputs.source-ref || github.ref }}
- name: Set up Python 3.12
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
Prompt To Fix With AI
This is a comment left during a code review.
Path: .github/workflows/integration-test-library-mode.yml
Line: 47-54

Comment:
**GitHub Actions not pinned to commit SHA**

Both `actions/checkout@v4` and `actions/setup-python@v5` use mutable version tags. Per the repository's `github-actions-security` rule, third-party actions must be pinned to a full commit SHA to prevent supply-chain attacks.

```suggestion
      - name: Check out repository code
        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683  # v4.2.2
        with:
          ref: ${{ inputs.source-ref != '' && inputs.source-ref || github.ref }}

      - name: Set up Python 3.12
        uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065  # v5.6.0
        with:
          python-version: '3.12'
```

How can I resolve this? If you propose a fix, please make it concise.

@charlesbluca charlesbluca marked this pull request as draft April 9, 2026 15:33
charlesbluca and others added 5 commits April 9, 2026 15:53
- Wrap bare `import lancedb` in try/except in lancedb_store.py; add None
  guard in handle_lancedb for [remote]-only installs
- Make graph_pipeline.py import of handle_lancedb lazy to avoid startup
  crash when lancedb is not installed
- Add None sentinels for all optional pymilvus/minio/scipy imports in
  milvus.py (previously only CONSISTENCY_BOUNDED was set)
- Guard cv2.INTER_LANCZOS4 and cv2.resize in _resize_image_opencv with
  PIL fallback when cv2 is unavailable
- Restore graceful degradation in embed_text_main_text_embed: return
  error-payload DataFrame instead of re-raising on embedding failure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…izer

- Restore handle_lancedb as module-level import in graph_pipeline.py so
  monkeypatch.setattr works in batch_pipeline tests (safe now that
  lancedb_store.py guards its own import lancedb)
- Early-return in txt_file_to_chunks_df before calling tokenizer when
  file content is empty, matching the pattern in html/convert.py
- Mock _get_tokenizer/_get_txt_tokenizer in txt and html unit tests to
  avoid hitting HuggingFace (proxy-blocked in this environment)
- Skip audio benchmark test gracefully when Ray cannot write to the
  filesystem (EROFS) rather than failing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move docstrings in txt_file_to_chunks_df and txt_bytes_to_chunks_df
  to immediately follow the def line so they are recognized as actual
  docstrings (previously appeared after variable assignments)
- Remove conflicts block in pyproject.toml that referenced undefined
  cpu/cu130 extras; torch wheel selection is already handled by
  sys_platform markers in [tool.uv.sources]

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Merge remote extra into base dependencies (document parsing and NIM
  client libs are now core; no extra needed for remote-only installs)
- Move lancedb into base dependencies (it is the default VDB solution)
- Merge local-cpu and local-gpu into a single local extra (GPU assumed)
- Drop lancedb from [stores]; update [all] to reference [local]
- Remove now-unnecessary lancedb import guards in lancedb_store.py
- Update integration-test workflow to install bare nemo_retriever
- Update DEPENDENCY_LAYERS.md to reflect new structure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
charlesbluca and others added 7 commits April 9, 2026 16:52
- Replace core-only install with [local] extra for local GPU path; add
  separate remote-only install command for NIM inference users
- Reframe Step 2 torch override: clarifies it is a PyPI/CUDA workaround,
  not a completion of the GPU stack; mark it skip-able for remote users
- Add note before pipeline examples that they require [local] + CUDA torch
- Fix [svg] extra reference to [multimedia]

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…workflow

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Made-with: Cursor

# Conflicts:
#	nemo_retriever/pyproject.toml
- Add ensure_openai_embeddings_http_url in nv-ingest string_processing

- Use it in infer_microservice (api + client) and retriever main_text_embed

- CPU embed actor: local HF when no remote endpoint instead of default integrate URL

Made-with: Cursor
After the embed URL fix, the CPU actor uses a local embedder when no
HTTP endpoint is set. Update the unit test to assert that behavior and
mock create_local_embedder so the suite does not require HF revisions.

Made-with: Cursor
@charlesbluca charlesbluca marked this pull request as ready for review April 13, 2026 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant